We want to know gene-phenotype correlations. This is the oldest and most central topic of genetics research. But in most cases, it’s difficult to associate an 1-to-1 relationship between gene and phenotype, except for some rare genetic disorders that occur in early childhood. However, transcript expression, although not visible to the naked eye, can be seen as a form of molecular phenotype for which to establish a correlation with DNA variants.
eQTL does just that. It describes the correlation between DNA variants and transcript expression level.
The following figure is a simplified overview of eQTL.
eQTL itself means the genomic region that contains sequence variants that affect another gene’s expression level. It’s similar to QTL in that it links phenotypic data with genotype data and looks for correlation between them. But in the case of eQTL, expression level is observed rather than the genotypic variation.
Well, the thing is, expression level is the mediating variable between genotype and phenotype. So it can help us gain insight and new interpretations in genomic data analysis. In this exercise, we learn various methods used in clustering anlaysis and use various eQTL analytical tools to reach an interpretation and conclusion of our own.
Microarray data analysis: 1) preprocessing, 2) normalization, 3) differentially 4) expressed gene detection, 5) gene expression pattern analysis, 6) clustering analysis, 7) classification analysis, 8) biological annotations of the analyzed results, 9) and functional inference.
Clustering analysis can be viewed as a type of Unsupervised machine learning for pattern analysis or similarity structure analysis. Simply put, it’s the most primitive form of learning that clusters like things and takes apart not-like things. This approach is based on the expectation that these clusters to show specific pattern that can be distinguished from one another.
Clustering can be divided into two big categories:
Hierarchical clustering
Partitional clustering
There are also variations like SOM (or SOFM Self-Organizing Feature Map), and dimensionality reduction algorithms like PCA (principal component analysis) can be also called clustering analysis.
QTL (quantitative trait loci) method from traditional statistical genetics have evolved into eQTL (expression QTL) analytic method. QTL means certain regions in the genome that show correlation with certain phenotypic quantitative trait variation. So the good ole’ days’ Mendelian genetics explored the relationship between genotype and phenotype. eQTL takes account of the probable assumption that transcripts’ expression level regulation mediates genotype and phenotype. This was to address the problem that genotypes cannot fully explain genotype-phenotype relationship. It clearly wasn’t the full picture. It made a progress from the common notion that certain genotype determines a specific phenotype and proclaimed that there are indeed cases where certain genotypes can affect expression levels of other genes. Such regulation in expression level in turn manifests itself as a phenotype with complex interactions. So the expression level of a gene is defined as eQT (expression quantitative trait) and we look for the genomic region (loci) that are correlated with this variation in phenotype. Omics-level DNA RNA analysis enabled us to analyze the correlation between eQT and DNA position (L): eQT-L correlation \(\rightarrow\) And this, ladies and gentlemen, is eQTL.
Some even call the expression level as a mediating phenotype (intermediate phenotype) as molecular phenotype. There have been some accounts where candidate genomic regions discovered through GWAS were not protein-coding region. This made it difficult to infer and annotate the biological functional annotation. It’s also a possibility that some of them are actually eQTLs that regulate expression level of other genes. So eQTL begins with the simplistic assumption that variation in non-coding regions are probably variants that are related to gene expression regulation. These mechanism of action of such eQTL can either be cis-acting or trans-acting. Cis-eQTL implies that a nearby genetic variation in regions like promoter can directly and adjacently affects gene expression. Trans-eQTL is when the genomic variant’s product functions as an intermediary that indirectly regulates the gene expression of a second gene. Although it can be confounding, the terms cis and trans do not simply pertain to the classification based on the physical distance, as in “local” and “distant.”
So here’s the overall workflow of eQTL analysis exercise.
We have breast cancer sample x10 and normal sample x10, which are captured by Agilent chip WES. The sample data are microarray gene expression data of breast cancer. There are 10 samples for patient group (BC1~BC10) and 10 samples for normal control group (NB1~NB10). Now there are total of 159 genes, and the data is downloaded from breast cancer data from TCGA. The platform is Agilent Expression 244K microarrays and here’s how the data were preprocessed; it went through 1) Lowess normalization, 2) Gene level summarize (averaging probe expression value, 3) T-test (patient group vs. normal group), 4) differentially expressed genes selected (FDR adjusted p-value < 0.005). We’re gonna t-test over multiple genes and identify differentially expressed genes (159 genes, to be exact) by adjusted p-value of less than 0.005 (<0.005). This data is passed onto clustering analysis and eQTL resources. In clustering analysis, samples are clustered if they share certain expression features, by hierarchical clustering and PCA. Likewise, genes are also clustered either by K-means algorithm that clusters a given dataset into k-number of clusters.It’s a form of unsupervised machine learning that operates by minimizing the overall dispersion among the clusters.
In this exercise, we use packages som and ggplot2. Install packages by following code.
Since it wasn’t present on bioconductor, I downloaded som from cran.1
Before using som, let’s learn what the “Self-Organizing Map” is all about. Here’s a nicely written blog article on the subject. I will paraphrase this.
So it randomly positions the grid’s neurons in the data space.
Selects one data point either randomly or systematically cycling thorugh the dataset in order.
Finds the neuron closest to the chosen data point. This neuron is called the “Best matching unit” (BMU).
Now, move the BMU closer to that data point. The distance moved by the BMU is determined by a certain “learning rate”, which decreases after each iteration.
Also, the algorithm moves BMU’s neighbors themselves closer to that data point as well, with farther away neighbors moving less.
Neighbors are identified using a radius around the BMU, and the value for this radius decreases after each iteration.
Update the learning rate and BMU radius, and repeat steps 1~4. Iterate these steps until positions of neurons have been stabilized.
So overall, the grid takes a distorted form with its unit neurons getting densely concentrated around actual clusters.
Load the libraries.
library(som)
library(ggplot2)
20 people’s samples, 159 genes’ expression profiles are stored in ~/eQTL/breast.txt. Each row indicates each gene and each row corresponds to each sample. The elements contain expression level profiles. Now, load the data using read.delim. It’s specifically for reading tab-delimited files into tables.
getwd()
## [1] "/home/seungho/GDA/eQTL"
brc=read.delim("breast.txt", header=TRUE, row.names=1)
brc= as.data.frame(brc)
head(brc)
## BC1 BC2 BC3 BC4 BC5 BC6
## MMP11 4.2066667 3.5826667 1.415600000 2.25550000 1.756000 3.3433330
## CDCA2 -2.1091667 -1.6561667 -2.623000000 -1.63600000 -3.151333 -2.7011670
## ASPM -1.7484545 -2.7056429 -2.551785714 -1.71257143 -3.474769 -1.9912140
## TMEM88 -0.2888333 -0.5121667 -0.008333333 0.01033333 0.458500 0.4891667
## KIAA0101 -1.7595000 -1.4788000 -1.332100000 -1.42540000 -3.424600 -2.1232000
## ZDHHC9 0.3760000 0.6685000 -0.022500000 0.63900000 0.544000 0.8480000
## BC7 BC8 BC9 BC10 NB1 NB2
## MMP11 1.394833 2.9648330 3.7255000 2.407333 -0.7341667 -0.4866667
## CDCA2 -2.021167 -2.6096670 -2.8086667 -1.380667 -3.6373333 -3.4475000
## ASPM -1.974786 -2.3585710 -2.2496154 -1.349071 -3.7272143 -4.1337860
## TMEM88 0.870500 0.1896667 0.4393333 0.088000 1.7673333 1.5371670
## KIAA0101 -1.501900 -1.4268000 -1.6194000 -1.846900 -3.3982222 -3.6480000
## ZDHHC9 0.278500 0.8340000 0.5840000 1.016000 -0.0765000 0.1660000
## NB3 NB4 NB5 NB6 NB7 NB8
## MMP11 -0.6361667 -0.9161667 1.865000 -0.8198333 -0.1163333 0.1826667
## CDCA2 -4.1261667 -4.2118330 -3.456167 -3.4940000 -3.7546667 -2.9301667
## ASPM -4.1231429 -4.5689290 -4.304750 -4.2784290 -3.4375000 -2.4461429
## TMEM88 1.2563333 1.5588330 1.798000 1.1176670 1.5915000 2.3540000
## KIAA0101 -4.3877000 -4.3592000 -3.280700 -3.8828890 -3.7111250 -1.8986667
## ZDHHC9 -0.2240000 -0.2230000 -0.927000 -0.5365000 -0.5605000 -0.4435000
## NB9 NB10
## MMP11 -1.070000 -0.734000
## CDCA2 -3.994833 -3.899667
## ASPM -5.107429 -4.921643
## TMEM88 2.695833 0.548000
## KIAA0101 -4.711125 -3.890600
## ZDHHC9 0.163500 -0.216000
Microarray data has two axes: samples and genes. Consequently clustering can occur both ways: sample clustering and gene clustering. Gene clustering is when you cluster together the genes that have similar expression patterns over all the samples; and genes that are clustered together are tightly co-regulated genes under the samples’ conditions, having similar functionalities. Also we can cluster samples into clusters of samples that have similar expression patterns. For example, the exercise data may conveniently and expectedly be clustered into two clusters, 10 of breast cancer patients’ and 10 of normal control patients. It’s also possible to hierarchically cluster by both gene and samples, in two axes.
Hierarchical clustering is the most widely used method of clustering in microarray clustering analysis. There are two types: divisive and agglomerative. Usually agglomerative hierarchical clustering analysis is used, in which similarity between two individual subjects is measured and they are grouped together, up and up. Hierarchical clustering analysis will eventually cluster the entire dataset together as one big hierarchical tree or dendrogram, and this means we don’t have to select the target cluster number as in k-means or SOM. But we still have to determine the determining threshold to divide the clusters at some point.
Next example is a simple example of performing a hierarchical clustering analysis. We can set various variables and the most representative are distance function and linkage variable.
Distance functions:
Different distance functions can be used to different hierarchical clusters. It’s actually a very important problem to select the proper distance function that fits the purpose of a particular clustering analysis model, but it’s a bit over the scope of this practice so we’ll not cover this topic in depth here. Refer to this article. Default distance measure is the Euclidean distance. Correlation-based distance is often used in gene expression data analysis. Correlation-based distance considers two objects to be similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance.
Agglomerative clustering also can be divided into simple, complete, and average linkage, and more. Complete agglomerative usually results in better performance, but more computation is required. In this practice we perform the default, simple agglomerative hierarchical clustering analysis. Understanding the Concept of Hierarchical Clustering Technique
dst = dist(t(brc)) # Compute distance matrix. It's included in stats.
hc = hclust(dst) # Hierarchical clustering performed here. Also in stats.
plot(hc) # Draw the dendrogram
sc = c(rep("skyblue", 10), rep("pink", 10)) # Sample color settings
heatmap(as.matrix(brc), col=cm.colors(256), ColSideColors=sc) # Draw heatmap.
Diverse mapping and dimensionality reduction techniques including PCA can also be used for clustering. PCA is a dimension reduction technique that minimizes information loss and discovers principal component axis that best represents the data. In genomic data analysis, PCA axis is composed of linearly combined information of many genes, and this is why it’s also called “supergene” or “metagene”, in that it supersedes the dimension of just one gene.
Perform PCA analysis with the following code block. PCA also requires various variable and argument settings and these affect the outcome greatly, but such oversteps the scope and purpose of this practice, so we will not discuss this in detail.
p=prcomp(t(brc)) # Perform PCA
sc = c(rep("slateblue", 10), rep("hotpink", 10)) # Choose sample color
la = c(rep("C", 10), rep("N", 10)) # Choose sample names
plot(p$x[,1:2], col=sc, pch=8) # Draw the result plot of PCA analysis.
# Only take the PC1 and PC2.
p$x[,1:2]
## PC1 PC2
## BC1 -16.908535 0.8978444
## BC2 -16.114723 -2.0015157
## BC3 -19.414552 0.3805945
## BC4 -15.316779 0.8517747
## BC5 -3.596877 -2.2965054
## BC6 -9.308695 -2.7698020
## BC7 -4.825497 -1.2473441
## BC8 -15.261241 -1.3985995
## BC9 -18.195990 -2.6287455
## BC10 -17.919862 3.1176076
## NB1 11.696798 2.9042919
## NB2 13.723050 1.1911008
## NB3 16.453000 -1.9473951
## NB4 19.145211 -4.0321514
## NB5 14.436177 -0.4071183
## NB6 6.060183 1.0581736
## NB7 18.317468 1.9470519
## NB8 5.640119 11.6303669
## NB9 21.936732 -3.4652929
## NB10 9.454012 -1.7843363
plot(p$x[,1:2], col=sc, pch=8) # Draw the result plot of PCA analysis.
# Only take the PC1 and PC2.
text(x=p$x[,1], y=p$x[,2], labels=la, adj=-0.5)
summary(p)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 15.0622 3.43518 3.06786 2.73188 2.1508 2.08678 1.71096
## Proportion of Variance 0.7946 0.04133 0.03296 0.02614 0.0162 0.01525 0.01025
## Cumulative Proportion 0.7946 0.83591 0.86888 0.89501 0.9112 0.92647 0.93672
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.64618 1.52002 1.4733 1.38106 1.26834 1.23081 1.15672
## Proportion of Variance 0.00949 0.00809 0.0076 0.00668 0.00563 0.00531 0.00469
## Cumulative Proportion 0.94621 0.95430 0.9619 0.96859 0.97422 0.97953 0.98421
## PC15 PC16 PC17 PC18 PC19 PC20
## Standard deviation 1.07904 1.01909 0.97974 0.86651 0.77090 2.225e-15
## Proportion of Variance 0.00408 0.00364 0.00336 0.00263 0.00208 0.000e+00
## Cumulative Proportion 0.98829 0.99193 0.99529 0.99792 1.00000 1.000e+00
k-means clustering analysis is one of the oldest clustering techniques. Numerous, actually thousands of improved clustering methods have been proposed, but k-means is still one of the favorite and most powerful techniques used in big data analysis because of its simplicity and following computational speed. Required variable in k-means clustering is, of course, k. K designates how many groups you’re gonna cluster your data into. K-means first randomly selects corresponding number of centroids in the data and gradually approaches the optimal distribution.
So in the following figure, 1) you can see that three means (k=3) are randomly initialized in the data. 2) And then the algorithm divides the exploratory space into three parts, segregating closest data points to a randomized mean, which in turn results in candidate clusters (expectation). 3) Then the means of each candidate clusters are taken and updated as the new means. (maximization) 4) Again, the exploratory spaces are divided into three parts (expectation), 5) and new candidate clusters again result in updated mean (maximization). These steps are iterated until the means converge towards to optimized approximation.
If you have got some time, try getting the optimal K, which is a fairly complicated problem that easily exceeds the scope of this exercise and also requires special individual attention to each data set. Next code clusters the data into 10 groups.
km = kmeans(brc, centers = 10, iter.max = 1000) # Perform K-means.
res = cbind(km$cluster, brc) # Get the results.
res[1:5, 1:5] # View the results.
## km$cluster BC1 BC2 BC3 BC4
## MMP11 10 4.2066667 3.5826667 1.415600000 2.25550000
## CDCA2 4 -2.1091667 -1.6561667 -2.623000000 -1.63600000
## ASPM 4 -1.7484545 -2.7056429 -2.551785714 -1.71257143
## TMEM88 7 -0.2888333 -0.5121667 -0.008333333 0.01033333
## KIAA0101 4 -1.7595000 -1.4788000 -1.332100000 -1.42540000
What does km look like?
head(km)
## $cluster
## MMP11 CDCA2 ASPM TMEM88 KIAA0101 ZDHHC9 TPM3 PHF6
## 10 4 4 7 4 10 6 6
## TNFRSF18 JPH2 GPD1 MFSD5 ZDHHC24 GSN ZNF553 FAM128A
## 6 7 1 10 6 1 10 6
## C10orf90 TPX2 MANEAL UBE2Z PARP1 ROBO3 FHL1 EGR1
## 9 4 2 6 2 9 3 1
## CKS2 C20orf96 S100B RDH5 NUP210 IGSF10 CEP55 ANXA1
## 4 6 3 8 2 5 4 7
## ESCO2 TRIM59 LGALS12 RAB26 IRF9 CDCA5 AMT RGN
## 4 6 8 6 10 4 9 9
## SOD3 MME HIST1H3J KCNE1 STARD9 CYR61 STMN1 GLYAT
## 7 8 10 1 9 8 2 1
## PLEKHH2 PPAP2B CLCN2 PRR15 FGFR1OP ANKDD1A ETS2 ABHD11
## 9 9 6 10 6 9 9 10
## CAV1 CES1 SMOC1 CLCN6 AQP7 UBE2T MBOAT2 PTX3
## 3 3 5 9 8 4 10 5
## TOP2A TMEM47 PLP1 PTPN6 NACAD ACADL NLK CASP12
## 4 7 5 10 7 1 6 7
## AKR1C1 CLEC4G TK1 PTGS2 LIPE FGF7 C18orf45 ENTPD7
## 3 7 4 5 1 1 10 10
## BRI3BP CORO2B SORD RBP4 TP53BP1 GPX3 ZNHIT2 NCBP1
## 6 9 10 3 6 7 10 6
## AVPR2 KALRN ACVR1C NEK2 UBAP2L BUB1 NUDT16L1 MLF1IP
## 7 9 1 4 6 4 10 2
## GPR109B C13orf3 LRRIQ2 LIG3 FAM70A CX3CL1 C14orf80 WDR51A
## 8 4 6 6 9 1 2 2
## DOPEY2 NME3 ATP6AP1 PPP1R15A ATIC AQP7P2 NR4A1 STRBP
## 10 10 10 9 6 1 7 6
## CLEC3B C20orf94 SOX12 CDAN1 CXCL3 PLAGL1 SGCG PRC1
## 1 10 6 6 5 9 3 4
## GPC3 KIAA1683 NTF3 POLA1 TACC3 ZWINT CNTNAP3 KLF4
## 5 9 9 6 4 2 5 7
## IGFBP6 VIP DUSP1 HDGF CCDC88C HN1 ZBTB9 SLC22A3
## 9 7 1 2 10 2 10 9
## CSTF2 C3orf41 LOC387911 CXCL2 NUAK2 CA4 ZFP36 FXYD1
## 2 5 1 3 6 3 1 8
## PGM5 DPP6 COL10A1 SPRY2 CCBE1 LYVE1 KGFLP1 FOSB
## 7 9 10 9 9 1 1 1
## FOS EFCBP1 FANCI IRS2 MKI67 NMUR1 KIF4A TNNT3
## 8 5 4 7 4 8 4 9
## SLC4A7
## 2
##
## $centers
## BC1 BC2 BC3 BC4 BC5 BC6
## 1 0.363075000 0.9869296 0.23614907 0.87863565 1.4158222 1.23847828
## 2 -0.724200000 -0.7835421 -0.94208102 -0.81419630 -1.5894458 -1.06205919
## 3 -2.204748148 -2.5277944 -3.06008148 -2.16313889 -0.9402185 -1.08809074
## 4 -1.852155502 -2.0983106 -2.10368331 -1.63737203 -3.7455969 -2.62728063
## 5 -3.377546591 -4.0980575 -3.95562440 -3.17998167 -2.2623850 -2.90449920
## 6 -0.007212784 0.1894422 0.07867242 -0.03569387 -0.4356375 -0.12348621
## 7 -0.110435000 -0.2310828 -0.45856778 -0.13049589 0.3694478 -0.01652166
## 8 -0.481019444 -0.3413380 -1.43108333 -0.96151852 0.4436991 0.49738884
## 9 -1.326689406 -1.2134337 -1.66722945 -1.16984229 -0.2185684 -0.88804788
## 10 1.513416099 1.5491164 1.17330046 1.31984851 0.7771399 1.16591625
## BC7 BC8 BC9 BC10 NB1 NB2
## 1 2.3754676 0.77838474 0.3664718 0.63567136 3.7768426 3.67110928
## 2 -1.0056667 -0.87194581 -0.5976787 -0.45527542 -1.9828282 -2.04766867
## 3 -0.5671241 -2.30486122 -2.3545204 -2.24516844 1.0173500 1.57618886
## 4 -2.0911202 -2.07391503 -1.9828224 -1.16445242 -3.9672320 -4.07285784
## 5 -2.4772642 -3.79835630 -3.9719008 -3.87228500 -0.9562992 -0.46185125
## 6 -0.1178435 0.01534143 0.1339313 -0.09410268 -1.3201885 -1.18306411
## 7 0.6450744 -0.00328444 -0.2867215 -0.28392443 1.7280300 1.82731847
## 8 1.4509954 -0.18187037 -0.6460648 -0.61855556 2.4313009 2.95579633
## 9 -0.8273715 -1.26545330 -1.4766850 -1.53029277 0.4324883 0.71435088
## 10 1.1597494 1.41231046 1.6016769 1.10991135 -0.1328168 -0.02361199
## NB3 NB4 NB5 NB6 NB7 NB8
## 1 4.1138588 4.1067658 4.0031185 2.482437389 4.3094292 2.6657120
## 2 -2.4380829 -2.6757083 -2.1182782 -1.990384667 -2.7001051 -1.9813028
## 3 1.9871889 2.6507316 2.4749870 0.252229633 2.0645222 0.3792880
## 4 -4.6324932 -5.0340579 -3.8072554 -4.215714053 -4.2643911 -2.6756837
## 5 -0.8652204 -0.7564237 -0.6359271 -1.474255000 -0.6872667 -0.9039658
## 6 -1.1153377 -1.1649309 -1.1552762 -0.981594208 -1.5380764 -1.3620886
## 7 1.8415822 1.7928395 1.8185856 1.048155560 2.0923167 2.1288344
## 8 3.2745370 4.0663658 3.2179352 1.535680667 3.6766250 1.5937176
## 9 0.7576912 0.5998011 0.5416594 0.354678565 0.7925300 0.3025057
## 10 -0.1412604 -0.2253308 0.1417405 0.003561614 -0.5913799 -0.3147723
## NB9 NB10
## 1 4.3920282 3.0184356
## 2 -2.6363352 -2.1397810
## 3 3.7779037 0.7588093
## 4 -4.8524136 -4.6368213
## 5 -1.1419429 -1.1497679
## 6 -1.3570257 -0.9942895
## 7 2.4197939 1.2651672
## 8 3.9657176 2.2765556
## 9 0.8723458 0.3607403
## 10 -0.4746480 0.1975832
##
## $totss
## [1] 13772.63
##
## $withinss
## [1] 291.90953 78.08482 161.22155 170.15470 237.03222 140.45566 135.53206
## [8] 88.60056 190.42980 409.19959
##
## $tot.withinss
## [1] 1902.62
##
## $betweenss
## [1] 11870.01
write.table(res, 'kmeans.txt', sep =" ", row.names = TRUE, col.names=TRUE) # Make result flie
km$cluster["MMP11"]
## MMP11
## 10
MMP11 gene belongs to cluster 1.
One can visualize how genes in the same cluster exhibit different expression patterns in breast cancer patient group and normal group.
brc’s columns 1 through 10 correspond to breast cancer samples and 11 through 20 correspond to normal samples.
c = apply(brc[,1:10], 1, mean) # Mean expression level in the breast cancer patient group
# for brc[, 1:10], row-wise, perform mean for each gene.
n = apply(brc[,11:20], 1, mean) # Mean expression level in the normal group
# for brc[, 11:20], row-wise, perform mean for each gene.
a = data.frame(c, n) # Data format conversion
cc = rainbow(10)[as.factor(km$cluster)] # Color coordination for each gene clusters.
plot(a, col=cc, pch=8) # pch=8 designates point symbol star.
Refer to this article for point symbol designation.
cc
## [1] "#FF0099FF" "#33FF00FF" "#33FF00FF" "#0066FFFF" "#33FF00FF" "#FF0099FF"
## [7] "#00FFFFFF" "#00FFFFFF" "#00FFFFFF" "#0066FFFF" "#FF0000FF" "#FF0099FF"
## [13] "#00FFFFFF" "#FF0000FF" "#FF0099FF" "#00FFFFFF" "#CC00FFFF" "#33FF00FF"
## [19] "#FF9900FF" "#00FFFFFF" "#FF9900FF" "#CC00FFFF" "#CCFF00FF" "#FF0000FF"
## [25] "#33FF00FF" "#00FFFFFF" "#CCFF00FF" "#3300FFFF" "#FF9900FF" "#00FF66FF"
## [31] "#33FF00FF" "#0066FFFF" "#33FF00FF" "#00FFFFFF" "#3300FFFF" "#00FFFFFF"
## [37] "#FF0099FF" "#33FF00FF" "#CC00FFFF" "#CC00FFFF" "#0066FFFF" "#3300FFFF"
## [43] "#FF0099FF" "#FF0000FF" "#CC00FFFF" "#3300FFFF" "#FF9900FF" "#FF0000FF"
## [49] "#CC00FFFF" "#CC00FFFF" "#00FFFFFF" "#FF0099FF" "#00FFFFFF" "#CC00FFFF"
## [55] "#CC00FFFF" "#FF0099FF" "#CCFF00FF" "#CCFF00FF" "#00FF66FF" "#CC00FFFF"
## [61] "#3300FFFF" "#33FF00FF" "#FF0099FF" "#00FF66FF" "#33FF00FF" "#0066FFFF"
## [67] "#00FF66FF" "#FF0099FF" "#0066FFFF" "#FF0000FF" "#00FFFFFF" "#0066FFFF"
## [73] "#CCFF00FF" "#0066FFFF" "#33FF00FF" "#00FF66FF" "#FF0000FF" "#FF0000FF"
## [79] "#FF0099FF" "#FF0099FF" "#00FFFFFF" "#CC00FFFF" "#FF0099FF" "#CCFF00FF"
## [85] "#00FFFFFF" "#0066FFFF" "#FF0099FF" "#00FFFFFF" "#0066FFFF" "#CC00FFFF"
## [91] "#FF0000FF" "#33FF00FF" "#00FFFFFF" "#33FF00FF" "#FF0099FF" "#FF9900FF"
## [97] "#3300FFFF" "#33FF00FF" "#00FFFFFF" "#00FFFFFF" "#CC00FFFF" "#FF0000FF"
## [103] "#FF9900FF" "#FF9900FF" "#FF0099FF" "#FF0099FF" "#FF0099FF" "#CC00FFFF"
## [109] "#00FFFFFF" "#FF0000FF" "#0066FFFF" "#00FFFFFF" "#FF0000FF" "#FF0099FF"
## [115] "#00FFFFFF" "#00FFFFFF" "#00FF66FF" "#CC00FFFF" "#CCFF00FF" "#33FF00FF"
## [121] "#00FF66FF" "#CC00FFFF" "#CC00FFFF" "#00FFFFFF" "#33FF00FF" "#FF9900FF"
## [127] "#00FF66FF" "#0066FFFF" "#CC00FFFF" "#0066FFFF" "#FF0000FF" "#FF9900FF"
## [133] "#FF0099FF" "#FF9900FF" "#FF0099FF" "#CC00FFFF" "#FF9900FF" "#00FF66FF"
## [139] "#FF0000FF" "#CCFF00FF" "#00FFFFFF" "#CCFF00FF" "#FF0000FF" "#3300FFFF"
## [145] "#0066FFFF" "#CC00FFFF" "#FF0099FF" "#CC00FFFF" "#CC00FFFF" "#FF0000FF"
## [151] "#FF0000FF" "#FF0000FF" "#3300FFFF" "#00FF66FF" "#33FF00FF" "#0066FFFF"
## [157] "#33FF00FF" "#3300FFFF" "#33FF00FF" "#CC00FFFF" "#FF9900FF"
It seems that clusters can be divided into five groups that are more inclined to be expressed in breast cancer and five groups that have tendency to appear in normal tissue. Therefore, it seems feasible to name each of the two big clusters as “breast cancer cluster” and “normal cluster”.
Use ggplot for better visualization.
p = ggplot(a, aes(x=c, y=n, color=as.factor(km$cluster)))
p = p + geom_point(shape=8)
p = p + labs(colour = "gene cluster")
p
Better aesthetics with ggplot.
Self-organizing map (SOM) is a clustering analysis using neural network models. K-means clustering analysis only divides the entire dataset into k number of clusters, but does not perform analysis of the relationship among the clusters. However, we’ve already seen from above example that, regardless of the arbitrary cluster number that we designate (k), there might be a more apparent underlying pattern that exists among the clusters. In the above case, there are k=10 clusters, but they can also be divided into two big clusters that correspond to ‘normal cell cluster’ and ‘breast cancer cluster’. Self-organizing map retains the characteristic of original lattice structure of neural networks, but also moves towards clustering. Therefore we can observe clustering itself while retaining information about the relationship between cluster to cluster.
Perform SOM analysis of 3 x 3 lattice structure.
sm = som(brc, xdim = 3, ydim = 3)
plot(sm)
The resulting figure is SOM clustering analysis result. One lattice signifies one cluster and neighboring clusters imply that they are more similar to each other than non-adjacent clusters. Number at the top of each lattice equals the number of genes in each cluster. Total of 159 genes have been clustered into 7 clusters and each cluster contains 8 to 40 genes per cluster. The graphs in each lattice represent gene expression profiles of samples that belong to that cluster. One is able to observe that neighboring lattices have similar looking expression profiles.
NCBI GTEx (Genotype-Tissue Expression) eQTL Browser and seeQTL (A searchable human eQTL browser and database) are most widely used eQTL databases. GTEx provides convenient filter searches with gene ID, dbSNP ID, phenotype traits, etc, which users can utilize to make a variety of queries.
Let’s go to NCBI GTEx and search the eQTL for breast cancer’s differentially expressed gene, TNNT3.
If you click on IGV Browser in Actions on the right-most column, you can access GTEx IGV Browser.
Then go to Show Track Menu and select Breast-Mammary Tissue.
Grey dots correspond to all the statistically significant gene-snp pairs that are incis eQTL relationship. Red dots show the variants that have FDR of less than 0.05 with TNNT3 gene expression.
You can see that rs2734495 is a significant eQTL to genes TNNT3 and MRPL23 in breast mammary tissue.